testing and evaluation
Optimizing Automatic Summarization of Long Clinical Records Using Dynamic Context Extension:Testing and Evaluation of the NBCE Method
Zhang, Guoqing, Fukuyama, Keita, Kishimoto, Kazumasa, Kuroda, Tomohiro
Summarizing patient clinical notes is vital for reducing documentation burdens. Current manual summarization makes medical staff struggle. We propose an automatic method using LLMs, but long inputs cause LLMs to lose context, reducing output quality especially in small size model. We used a 7B model, open-calm-7b, enhanced with Native Bayes Context Extend and a redesigned decoding mechanism to reference one sentence at a time, keeping inputs within context windows, 2048 tokens. Our improved model achieved near parity with Google's over 175B Gemini on ROUGE-L metrics with 200 samples, indicating strong performance using less resources, enhancing automated EMR summarization feasibility.
Design and evaluation of AI copilots -- case studies of retail copilot templates
Furmakiewicz, Michal, Liu, Chang, Taylor, Angus, Venger, Ilya
Building a successful AI copilot requires a systematic approach. This paper is divided into two sections, covering the design and evaluation of a copilot respectively. A case study of developing copilot templates for the retail domain by Microsoft is used to illustrate the role and importance of each aspect. The first section explores the key technical components of a copilot's architecture, including the LLM, plugins for knowledge retrieval and actions, orchestration, system prompts, and responsible AI guardrails. The second section discusses testing and evaluation as a principled way to promote desired outcomes and manage unintended consequences when using AI in a business context. We discuss how to measure and improve its quality and safety, through the lens of an end-to-end human-AI decision loop framework. By providing insights into the anatomy of a copilot and the critical aspects of testing and evaluation, this paper provides concrete evidence of how good design and evaluation practices are essential for building effective, human-centered AI assistants.
- North America > United States (0.04)
- Europe > Switzerland (0.04)
- Retail (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Services (0.67)
User-Centric Evaluation of ChatGPT Capability of Generating R Program Code
This paper reports an evaluation of ChatGPT's capability of generating R programming language code from natural language input. A dataset specially designed for generating R program code was constructed with metadata to support scenario-based testing and evaluation of code generation capabilities in various usage scenarios of different levels of difficulty and different types of programs. The evaluation takes a multiple attempt process in which the tester tries to complete the code generation task through a number of attempts until a satisfactory solution is obtained or gives up after a fixed number of maximal attempts. In each attempt the tester formulates a natural language input to ChatGPT based on the previous results and the task to be completed. In addition to the metrics of average numbers of attempts and average amount of time taken to complete the tasks, the final generated solutions are then assessed on a number of quality attributes, including accuracy, completeness, conciseness, readability, well structuredness, logic clarity, depth of ex-planation, and coverage of parameters. Our experiments demonstrated that ChatGPT is in general highly capable of generating high quality R program code as well as textual explanations although it may fail on hard programming tasks. The experiment data also shows that human developers can hardly learn from experiences naturally to improve the skill of using ChatGPT to generate code.
- Research Report > Experimental Study (0.67)
- Research Report > New Finding (0.66)
Structured Machine Learning for 'Soft' Classification with Smoothing Spline ANOVA and Stacked Tuning, Testing and Evaluation
We describe the use of smoothing spline analysis of variance (SS(cid:173) ANOVA) in the penalized log likelihood context, for learning (estimating) the probability p of a '1' outcome, given a train(cid:173) ing set with attribute vectors and outcomes. The smoothing parameters governing f are obtained by an iterative unbiased risk or iterative GCV method. Confidence intervals for these estimates are available. In medical risk factor analysis records of attribute vectors and outcomes (0 or 1) for each example (patient) for n examples are available as training data.
</>aishield.STRAITE - AI Security Solutions for Testing and Evaluation
At AIShield, we have had an impressive year of growth and achievement. Our team has consistently demonstrated adaptability and a focus on pivoting when necessary, allowing us to make significant progress in our product, business, and team. To drive this progress, we have implemented several strategic initiatives, including an API-first product, targeting key industries, offering free product trials, hosting and launching our product on AWS, building demos, releasing white paper, enabling free security assessment, deploying defenses across the multi-cloud to edge continuum, and providing reference implementations with a python SDK. These efforts have helped us attract and serve many customers and have laid a strong foundation for our business moving forward. Our focus on AI security has enabled us to develop innovative technology that sets us apart from the competition.
How the DOD is developing its AI ethics guidance - FedScoop
It has been six months since the Department of Defense adopted ethical principles for artificial intelligence. Since then, the department's Joint AI Center has faced the daunting challenge of taking that conceptual work and scaling it to develop actionable guidance for the rest of the military. The goal is to give anyone who works in technology development -- from contracting officers to software developers -- a "shared vocabulary" for building ethics into any DOD work involving AI. What's at stake, leaders say, is ensuring that the DOD uses the emerging technology in ways that uphold the department's values while managing potentially huge shifts in the "character" of warfare. The first step is to agree on a document that turns the principles into clear guidance.
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (1.00)